This is an attempt at using a non-linear measure of correlation to identify similar trend in a time series. The idea and code is heavily influenced by this article. Additional ideas (not yet implemented) are influenced by Rob Hyndman’s workshop at the RStudio::conf 2020.
Distance correlation is a type of correlation that can detect non-linear and non-monotonic correlations. I became aware of distance correlation from a comment to an article discussing problems with a another non-linear, non-monotonic correlation measure.
Let’s try this method on Apple stock indeces. First, we download Apple’s data from Yahoo Finance using the quantmod package:
## Load Suncor stock
cenovus <- getSymbols("CVE", src = 'yahoo', from = '2015-01-01', auto.assign = F) %>%
as.data.frame() %>%
rownames_to_column(var = "date") %>%
mutate(date = ymd(date)) %>%
select(date, CVE.Open)
cenovus %>%
slice(1:10) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = TRUE, font_size = 12)| date | CVE.Open |
|---|---|
| 2015-01-02 | 20.420000 |
| 2015-01-05 | 20.280001 |
| 2015-01-06 | 19.420000 |
| 2015-01-07 | 19.799999 |
| 2015-01-08 | 19.639999 |
| 2015-01-09 | 19.910000 |
| 2015-01-12 | 19.690001 |
| 2015-01-13 | 19.080000 |
| 2015-01-14 | 18.690001 |
| 2015-01-15 | 19.930000 |
Now let’s try to identify a pattern of interest, for instance, the rapid downspike between 2015-08-11 and 2014-09-01:
## Identify pattern
pattern <- cenovus %>%
filter(between(date, as.Date("2019-04-04"), as.Date("2019-05-06")))
(cenovus %>%
ggplot(aes(x = date, y = CVE.Open)) +
geom_line() +
geom_line(data = pattern, color = "red") +
labs(x = "") +
scale_x_date(breaks = "6 months", date_labels = "%Y-%m" )) %>%
ggplotly()Let’s calculate the distance correlation measure and plot the top 3:
## Calculate distance
n <- nrow(pattern)
df2 <- cenovus %>%
anti_join(pattern, by = "date") %>% ##remove pattern
group_by(grp = as.integer(gl(n(), n, n()))) %>% ##create grouping factors
filter(n() == n) %>% ##removes windows of unequal number (usual last)
mutate(cor1 = DCOR(scale(pattern$CVE.Open), scale(CVE.Open))$dCor) %>%
ungroup()
## Plot
df3 <- df2 %>%
mutate(rank1 = dense_rank(desc(cor1)))
(ggplot() +
## Plot full series
geom_line(data = cenovus, aes(x = date, y = CVE.Open), color = "grey") +
## Plot pattern
geom_line(data = pattern, aes(x = date, y = CVE.Open), color = "red", size = 0.8) +
## Plot closest pattern by corr.x
geom_line(data = filter(df3, rank1<=3), aes(x = date, y = CVE.Open, group = grp), color = "purple") +
scale_x_date(breaks = "12 months") +
labs(x = "",
title = "Cenovus. Red: Pattern, Purple: top 3 matches")) %>%
ggplotly()The top 3 correlations are pretty close in shape. Notice that this approach cannot distinguish symmetry which is quite important for many applications. See some of the untested improvements ideas below that may help address this.
Below are some improvements ideas based on creating several statistics and using principal components to reduce the dimensionality:
df2 <- df %>%
mutate(cor1 = Hmisc::spearman2(scale(pattern$var), scale(var))[[1]]) %>%
mutate(cor4 = Hmisc::hoeffd(scale(pattern$var), scale(var))$P[[1,2]]) %>%
mutate(entropy = entropy(var) - entropy(pattern$var)) %>%
mutate(flat = longest_flat_spot(var) - longest_flat_spot(pattern$var)) %>%
mutate(cross = n_crossing_points(var) - n_crossing_points(pattern$var)) %>%
mutate(lumpiness = lumpiness(var) - lumpiness(pattern$var)) %>%
mutate(hurst = hurst(var) - hurst(pattern$var)) %>%
mutate(nonlin = nonlinearity(var) - nonlinearity(pattern$var)) %>%
## Principal component or similar
df.pc <- df2 %>%
select(-date, -var, -grp, -lumpiness)
prc <- prcomp(df.pc, scale = TRUE)
df2 <- df2 %>%
bind_cols(as.data.frame(prc$x))
Last updated: January 29, 2021